Adolescent leisure reading and its longitudinal association with prosocial behavior and social adjustment

Abstract Reading is a popular leisure activity for children, teenagers, and adults. Several theories agree that reading might improve social cognition, but the empirical evidence remains tentative, with research on adolescents especially lacking. We employed a very large, and nationally representative, longitudinal dataset from the National Educational Panel Study (NEPS) in Germany to examine this hypothesis. Specifically, we tested whether reading prospectively predicted future self-reported prosocial behavior and social adjustment in adolescents, controlling for a number of covariates. Two-way cross-lagged panel analyses probed the longitudinal relationship between leisure reading and these social outcomes from Grade 6 to Grade 9. In addition, we examined the effect of cumulative reading experience across Grades 5–8 on future social outcomes, using structural equation modeling. We also explored the unique contributions of cumulative reading experience in different literary genres (classic literature, popular literature, nonfiction, comic books). Cumulative reading in general did not predict future prosocial behavior and social adjustment. However, cumulative reading of modern classic literature was positively associated with later prosocial behavior and social adjustment. Protocol registration The stage 1 protocol for this Registered Report was accepted in principle on 08 November 2021. The protocol, as accepted by the journal, can be found at: 10.17605/OSF.IO/KSWY7.

www.nature.com/scientificreports/ meta-analysis of these experiments found that fiction reading improves social-cognitive performance relative to nonfiction reading or no reading (g = 0.15) 12 . However, the findings from these experiments suffer from several major shortcomings. First, most of these experiments do not assess the durability of these effects, looking only at immediate outcomes in the short-term. Thus, it is not clear whether reading narratives lead to improvements that represent meaningful long-term gains in social-cognitive abilities, or just temporary changes that could reflect a kind of priming or social cueing effect. Second, reading a single short story or an excerpt 6,28,29,32,33 or even an entire book 34 , does not reflect the theoretical accounts of these phenomena 8,9 . Theoretically, the social effects of reading are assumed to emerge from cumulative, repeated exposure to stories over long periods of time 6,8 .
Longitudinal research designs do a far better job of examining cumulative effects like those proposed for reading narrative fiction. Unfortunately, to our knowledge, at the time of this registered report there exist only a handful of longitudinal studies that have looked at this question. Van Schooten and de Glopper investigated the development of literary response (i.e., cognitive responses to texts by readers) in Dutch secondary school students (Grades 7 to 11) 35 . They found that the amount of leisure reading predicted changes in self-reported empathy for fictional characters, from Grade 8 to 9, Grade 10 to 11, and from Grade 9 to 11 (but not from Grade 7 to 8, or Grade 7 to 9). Although certainly encouraging, this study relies entirely on a self-report measure of empathy for fictional characters, and does not examine any real-world social cognition or related behavior. Nor did this study control for any potential third-variable explanations. A second longitudinal study on this topic used a propensity score matching analysis to analyze data from the UK Millennium Cohort Study 36 . These researchers found that reading frequency at the age of 7 predicted children's prosocial behavior at age 11. Importantly, potential third variable explanations were controlled by matching participants on a number of individual, social, familial, and behavioral characteristics. However, they did not find that reading predicted other related outcomes, such as fewer emotional problems, less difficulty with peers, or fewer issues related to conduct. These null findings indicate that either the influence of leisure reading does not extend to these outcomes, or that associations with these outcomes are weak and could not be detected in this study. In addition, this analysis did not examine the potential for differential effects based on genre, such as comparing popular and literary fiction to nonfiction, as in our study.
In sum, prior evidence from cross-sectional correlational, and experimental studies suffer from several limitations, and together do not adequately answer whether reading fosters pro-social behavior and social adjustment in the long run. Long-term longitudinal studies that span many years are a better fit for this research question, and the results from prior studies of this type are certainly promising. Our proposed study improves upon and expands the purview of this past work by studying adolescents, controlling for third-variable explanations, and examining possible differential effects based on genre.
Although the SPaCEN framework proposes that all narratives have equal potential to foster social cognition, other theorists propose that this potential is limited to literary fiction 6 . Empirically, however, evidence regarding the importance of literariness is decidedly mixed. In terms of correlational studies, one examination looked at lifetime exposure to different literary genres and whether this exposure predicts mental-inferencing ability 24 . Importantly, these researchers attempted to rule out third variable explanations by controlling for trait personality, gender, age, English fluency, and exposure to nonfiction. After accounting for these variables, and in direct contrast to the proposed importance of literariness, only two forms of popular fiction remained predictors: romance and suspense/thriller. However, Kidd and Castano followed a parallel approach in several distinct samples and found just the opposite: after controlling for a number of covariates (e.g., gender, age) only exposure to literary texts, and not popular fiction, predicted social-cognitive performance 32,37 .
With respect to experiments, Kidd and Castano contrasted literary and popular fiction in several studies 6 . Across their experiments, they found that reading literary fiction resulted in better social-cognitive performance than reading popular fiction (Experiments 2-5) and a no-reading control group (Experiments 2 and 5), whereas popular fiction did not differ from the latter (Experiments 2 and 5). Although these findings have been directly and conceptually replicated 29,38 , there are just as many failures to replicate 28,33 , with a recent replication attempt by the same group reporting mixed findings 32 .
In sum, evidence for the idea that literariness might be crucial for the beneficial effects of narrative literature is inconsistent. These studies also share the limitations previously discussed, as they also rely on correlational and single-session experimental designs. Cross-sectional correlation studies cannot evaluate causal direction and are vulnerable to alternative, third-variable explanations. And single-session experimental examinations of short-term effects do not match the theoretical assumption that frequent and prolonged engagement with stories is important in order to foster social benefits 8 . Both shortcomings underline the need for more long-term longitudinal research on this topic.
The goal of the present study is to extend past research by examining potential longitudinal effects of reading on social-cognitive skills and related behavior, in adolescents and in terms of genre, after controlling for key alternative explanations. To do so, we analyzed German data, from the National Educational Panel Study (NEPS). This study collects longitudinal data on educational processes and competencies based on large, nationally-representative, samples. More precisely, we used data from the starting cohort in Grade 5, which follows students all the way to Grade 9. The NEPS provides data on students' leisure reading habits as well as their prosocial behavior and social adjustment in Grades 6 and 9 (self-reports and parent reports). The structure of this dataset allowed us to examine bidirectional relationships between leisure-reading, prosocial behavior, and social adjustment over time, while also controlling for previous levels of prosociality and adjustment. In line with previous research, we made the following predictions (see Table 1): 1. Students' leisure reading at Grade 6 will predict future prosocial behavior/social adjustment (in Grade 9), even after controlling for earlier prosocial behavior/social adjustment. www.nature.com/scientificreports/ 2. Leisure reading for a past time period (Grades 5-8) will cumulatively predict future prosocial behavior/social adjustment (Grade 9), even after controlling for earlier prosocial behavior/social adjustment.
Finally, the NEPS also provides information on exposure to different genres (e.g., classic literature, popular literature, comic books, nonfiction), allowing us to examine how genre relates to social behavior over time. Two alternative hypotheses may be derived from the research literature here. According to theories such as the SPaCEN framework 8 , the narrativity of reading materials should drive potential effects on prosocial behavior and social adjustment. Therefore, these theories would predict positive associations with social outcomes for exposure to classical literature, popular literature, and comic books, but not for nonfiction. In contrast, other www.nature.com/scientificreports/ theories assume that the literariness of reading materials should drive the effects 6,32,37 . These theories predict positive social outcomes for exposure to classical literature, but for not for the other genres. In line with the SPaCEN framework, we made the following prediction (see Table 1): 3. Leisure reading of narrative literature for a past time period (Grades 5, 7, and 8) will cumulatively predict future prosocial behavior/social adjustment (Grade 9), even after controlling for earlier prosocial behavior/ social adjustment.
Moreover, we examined whether these relationships remained after controlling for non-verbal intelligence, socioeconomic status, and migration background 25,31 . The latter is important to control for, as children with a migration background may be delayed with respect to German language comprehension, which impacts leisure reading. Moreover, children with a migration background in Germany show higher levels of internalizing problems 39,40 , which could take the form of worse prosocial behavior or social adjustment.

Results
Measurement invariance testing of leisure reading, self-reported prosocial behavior, and peer problems. We tested the SDQ scales of Prosocial Behavior and Peer Problems (self-reports) and leisure reading for their measurement invariance between Grades 6 and 9 ( Table 2). As imputed data only provide averaged and therefore approximate information on model fit, we used the original non-imputed data for testing longitudinal measurement invariance. For leisure reading, strict measurement invariance was established. For Table 2. Tests of Measurement Invariance across Measurement Points for Leisure Reading, Prosocial Behavior, and Peer Problems. Longitudinal measurement invariance testing was conducted with the original non-imputed data. Residual correlations were allowed between identical items across measurement points. To allow for an assessment of weak measurement invariance of leisure reading, the residual correlation between the measurement points for the item on leisure reading during the school days, which was not significant, was set to zero. a Thresholds for items i and g were freely estimated. b Thresholds and residual variances for items i and g were freely estimated. c Residuals of inverted items e and f were allowed to correlate at both measurement points d Factor loading for item h was freely estimated. e Factor loading and threshold for item h were freely estimated. f Factor loading, threshold, and residual variance for item h were freely estimated. www.nature.com/scientificreports/ Prosocial Behavior, weak measurement invariance was established. However, an analysis of partial measurement invariance found that strict invariance could be established for three of the five items. For Peer Problems, weak measurement invariance could not be established. An inspection of the items indicated that one item did not show equal factor loadings across time, but for the other four items strict measurement invariance could be established. Accordingly, for Prosocial Behavior and Peer Problems, the models including partial strict measurement invariance were used for further analyses.
Cross-lagged relations between leisure reading and self-reported prosocial behavior and peer problems. Cross-lagged panel models indicated that leisure reading at Grade 6 and Grade 9 was positively correlated to concurrent self-reported prosocial behavior (Fig. 1A) and peer problems (Fig. 1B). These correlations were small. However, we found no significant effect of leisure reading on later self-reported prosocial behavior (Fig. 1A) or peer problems (Fig. 1B). In addition, we found a significant effect of prosocial behavior www.nature.com/scientificreports/ on later leisure reading (Fig. 1A). Thus, our hypotheses were not supported. Additional analyses with students nested in schools and including only those students who remained in the same school from Grades 5 to 9 yielded comparable results (see Fig. 1A and B).
Cumulative leisure reading and self-reported prosocial behavior and peer problems. Cumulative leisure reading positively predicted later self-reported prosocial behavior when controlling for prior prosocial behavior ( Fig. 2A). However, this small effect vanished when the control variables were included. In addition, cumulative leisure reading did not significantly predict later self-reported peer problems (Fig. 2B). Thus, our hypotheses were not supported. Additional analyses with students nested in schools and including only those students who remained in the same school from Grades 5 to 9 yielded comparable results (see Fig. 2A and B). www.nature.com/scientificreports/ Cumulative leisure reading across different genres and self-reported prosocial behavior and peer problems. Cumulative reading of nonfiction did not predict later self-reported prosocial behavior ( Fig. 3A) and peer problems (Fig. 3B). Among the narrative reading categories, only cumulative reading of modern classic literature positively predicted more self-reported prosocial behavior and less self-reported peer problems, when the control variables were included. These effects were small, but statistically significant. However, the similar associations we predicted for popular literature and comic books were not supported. Additional analyses with students nested in schools and including only those students who remained in the same school from Grade 5-9 yielded comparable results (see Fig. 3A and B).

Exploratory analyses.
In exploratory analyses that were not pre-registered, we analyzed the parental reports of Prosocial Behavior and Peer Problems following the same procedure as for the self-reports. In addition, we replicated all analyses with an even more conservative methodological approach, including gender, reading competence, and trait openness to experience as additional control variables.
Measurement invariance testing of parent reported prosocial behavior and peer problems. We tested the SDQ scales of Prosocial Behavior and Peer Problems (parental reports) for longitudinal measurement invariance between Grades 6 and 9 (Table 2), using the same procedure as for self-reports. For Prosocial Behavior, strict measurement invariance could be established. For Peer Problems, weak measurement invariance could not be established. An inspection of the items indicated that-similar to the self-reports-one item did not show equal factor loadings across time, but for the other four items strict measurement invariance could be established. Accordingly, for Peer Problems, the model including partial strict measurement invariance was used for further analyses.
Cross-lagged relations between leisure reading and parent reported prosocial behavior and peer problems. Cross-lagged panel models indicated that leisure reading at Grade 6 was again positively correlated to concurrent parent reported peer problems (Fig. 4B). However, we found no significant effect of leisure reading on later parent reported peer problems. In contrast, leisure reading was even-to a small degreenegatively associated to later parent reported prosocial behavior (Fig. 4A). Additional analyses with students nested in schools and including only those students who remained in the same school from Grades 5-9 yielded similar results. An exception was that the association between leisure reading and later prosocial behavior was no longer statistically significant after adding the control variables (see Fig. 4A and B).
Cumulative leisure reading and parent reported prosocial behavior and peer problems. Cumulative leisure reading did not significantly predict later parent reported prosocial behavior (Fig. 5A) or peer problems (Fig. 5B). Additional analyses with students nested in schools and including only those students who remained in the same school from Grades 5 to 9 yielded comparable results (see Fig. 5A and B).
Cumulative leisure reading across different genre and parent reported prosocial behavior and peer problems. Cumulative reading of nonfiction did not predict later parent reported peer problems (Fig. 6B) and was even negatively related to later parent reported prosocial behavior (Fig. 6A). Consistent with the associations for self-reported prosocial behavior and peer problems, among the narrative genres only cumulative reading of modern classic literature positively predicted greater parent reported prosocial behavior and fewer parent reported peer problems. These statistically significant relations were small in magnitude and the similar predictions we made for popular literature and comic books were not supported. Additional analyses with students nested in schools and including only those students who remained in the same school from Grades 5-9 yielded largely similar results. However, in these analyses none of the reading categories were significantly related to parent reported prosocial behavior (see Fig. 6A and B).
Replication of all analyses including gender, reading competence, and trait openness to experience as additional control variables. The more conservative methodological approach, in which gender, reading competence, and trait openness to experience were included as additional control variables, replicated the findings of the original analyses, in which only migration background, non-verbal intelligence, and parents' educational background were included as control variables (see supplemental material on OSF: https:// doi. org/ 10. 17605/ OSF. IO/ D97QN).

Discussion
We examined whether leisure reading would predict later prosocial behavior and social adjustment in a large, nationally-representative, longitudinal dataset from Germany. Both self-reports (pre-registered) and parental reports (exploratory) were examined. Based on previous theoretical accounts we hypothesized that leisure reading would improve social-cognitive skills and result in positive downstream consequences [8][9][10] . To examine our hypotheses, we used cross-lagged panel analyses between leisure reading and prosocial behavior/social adjustment from Grades 5 to 9, as well as analyses linking cumulative reading across Grades 5 to 8 to later prosocial behavior/social adjustment in Grade 9. In contrast to our hypotheses, neither leisure reading in Grade 5 nor cumulative leisure reading from Grade 5-8 predicted later prosocial behavior or social adjustment, when controlling for prior prosocial behavior/social adjustment and other important control variables (migration background, education, and nonverbal intelligence). Thus, our results do not align with the assumption that reading frequency in general results in a benefit to social cognition and associated social consequences. However, our www.nature.com/scientificreports/ www.nature.com/scientificreports/ study does not represent a strict test of this assumption because these archival data do not include measures of social-cognitive skills (e.g., theory-of-mind or empathy), only putative downstream consequences of these skills. Prosocial behavior and social adjustment represent distal consequences of social cognition that are also likely influenced by many other things (e.g., parenting, socialization, sociocultural factors). In many ways, expecting leisure reading to predict these downstream consequences of social cognition is a rather stiff test. The NEPS data also did not measure the motivation to engage in social processing, and possessing the ability to engage in social cognition is separate from the motivation to do so [41][42][43] . Even if reading promotes social-cognitive skills, in the absence of the appropriate motivation these downstream consequences would not be observed. Another aspect of our study to note is that many theories propose an effect for reading specific types of text (e.g., fiction or literary fiction in specific), whereas these main analyses examined leisure reading as a whole. That said, our results do partially replicate findings from the UK Millennium Cohort study, in which reading frequency at the age 7 predicted prosocial behavior but not peer problems at age 11 36 . This study also joins a growing body www.nature.com/scientificreports/ of longitudinal research examining the relation between reading and social outcomes, which has sometimes supported a longitudinal association between the two 44,45 , and sometimes failed to find such an association 46 .
Because the NEPS data also provides information on leisure reading for different genres (nonfiction, popular literature, modern classic literature, comics), we also examined whether genre is relevant for predicting prosocial behavior and social adjustment. Based on the SPaCEN framework, we predicted that all the narrative categories, but not the nonfiction category, would predict these outcomes 8 . Consistent with this prediction, we found that cumulative reading of nonfiction across Grades 5 to 8 did not predict later prosocial behavior or social adjustment. This is consistent with previous cross-sectional correlational research, in which no associations-and sometimes even negative associations-are reported between nonfiction and social-cognitive skills 26,31 . Also as hypothesized, cumulative reading of modern classic literature predicted later prosocial behavior and fewer peer www.nature.com/scientificreports/ problems. However, this was not observed for the other two narrative categories: popular literature and comic books. Accordingly, our results are a better fit with Kidd and Castano's notion that only literary fiction, but not popular fiction, improve social-cognitive skills 6,32,37,38 . These authors theorize that literary fiction is unique in its ability to promote social cognition by virtue of "showing" rather than "telling", and presenting complex three-dimensional characters rather than stereotypical and flat ones. Both techniques should prompt readers to engage in mental inferencing, in order to understand what these types of characters are thinking. Texts that are concrete and explicit, and rely on simple and predictable characters, seem less likely to evoke mental inferences. This finding, that literary fiction is unique in predicting social adjustment and prosocial behavior, would seem to contradict a previous correlational study, which found that reading romance fiction is the most robust predictor of social-cognitive abilities in adults 24 . However, this discrepancy could easily be a function of different populations (i.e., adolescents in these data, rather than adults) and therefore likely different constituting popular fiction 24 . It is also important to note that these effects were small (β ≤ 0.10) and are unlikely to be detected with smaller samples, as we expected in our a-priori power analysis. As a limitation of the present study, we have to note that school and, in particular, class membership was not stable for a number of students as they progressed from Grade 5 to 9. Accordingly, the pre-registered analytic approach (i.e., students nested in classes and schools) had to be adjusted. We did two complementary analyses, one with the total sample without nesting, and one with a reduced sample with students nested in schools. Both analyses are limited and may result in biased parameter estimates. Importantly, however, both approaches resulted in very similar parameter estimates and led to the same conclusions. This supports the robustness of the observed effects. Cross-classified models are an additional analytic option that might be used to address the problem of non-stable group memberships across time points 47 . However, this would also result in a reduced sample due to missing information on students' group memberships at several time points. In addition, to our knowledge, Mplus cannot yet handle a combination of several cross-classified and nested group memberships, as is required for our data.
Encouragingly, our pattern of findings were largely replicated in our exploratory analyses of the parent reports of prosocial behavior and social adjustment. Moreover, a more conservative analytic approach, controlling for numerous other variables (i.e., gender, students' reading competence, and trait openness to experience) 24,31 , yielded comparable results for both self-reports and parental reports. This demonstrates that the observed effects, though small in magnitude, are robust.
In sum, our results do not support the assumption that (cumulative) leisure reading in general results in positive downstream consequences associated with better social cognition. However, our results do confirm that reading modern classic literature predicts more prosocial behavior and better social adjustment in the future, with no such beneficial effects observed for nonfiction. This suggest that it is worthwhile to investigate the effects of reading different genres in future research. To do so, a greater number of well-powered longitudinal studies directly examining social-cognitive skills (e.g., theory-of-mind, empathy, emotion recognition, social memory) are clearly needed. The data protection unit of the LIfBi checked data collection procedures, instruments, and documents. Participation in the NEPS study was voluntary, and participants could withdraw from the study at any time. All participants gave informed consent.

Methods
Our data comes from "Starting Cohort Grade 5", which provides data starting in Grade 5 (around 10 years of age) and ending in Grade 12. There were nine assessment waves, with each wave being conducted in different school years or grades (Wave 1 in Grade 5, Wave 2 in Grade 6, etc.). As an exception to the rule, two assessment waves were conducted in close succession in Grade 9 (Waves 5 and 6). Quantity of leisure reading was assessed from Waves 1 to 5 (Grades 5 to 9) and reading frequency for specific literary genres was measured in Waves 1 (Grade 5), and Waves 3 (Grade 7) through 5 (Grade 9). Students' self-reported prosocial behavior and social adjustment were assessed at two measurement points, Wave 2 in Grade 6 and Wave 6 in Grade 9. All of these data are publicly available upon request from the LIfBi.
Participants. Participants were 6,112 students who were representatively sampled from all children attending the fifth grade in Germany in 2010/2011. A stratified multi-stage sampling based on explicit stratification (schools educating students in Grade 5 and in Grade 9, schools educating students in Grade 5 but not in Grade 9, special-needs schools) and implicit stratification (school type, Federal State, regional classification, funding institution) was employed. Schools were first selected from the strata and then two classes (if available) per school were randomly selected. Data was then collected from entire school classes 50,51 . As students of specialneeds schools (570 participants in Grade 5) might differ from other students in a number of ways and were excluded from the panel cohort after Wave 4 (Grade 8), only regular students were included in the present analyses (5,208 participants in Grade 5) 51,52 . Of these students, 48.2% were female (< 1% missing), mean maximum ISCED-97 for parents (a combination of parents' highest school qualification and their last vocational qualifi- www.nature.com/scientificreports/ cation) across all waves (1 to 6) was 6.46 (25.7% missing), and 2,919 participated in all waves (1 to 6) between Grades 5 and 9 51,53 .

Measures. Demographics.
Demographic data were self-reported by students and parents, and included age, gender, migration status, and socioeconomic status. In the current study, migration background was defined as both of the child's parents being born outside of Germany. Socioeconomic status was based on a combination of parents' highest school qualification and their last vocational qualification (ISCED-97), ranging on an eleven-point scale from no school-leaving qualification to doctorate or habilitation 54,55 . Socioeconomic status and migration background were used as control variables in the analyses.
Leisure reading. The overall frequency of leisure reading was assessed with two self-report items, with one item addressing how much leisure reading takes place on normal school days and the other concerning normal nonschool days. Both items were answered on a five-point Likert scale (1 = not at all to 5 = more than 2 h). Quantity of leisure reading was assessed from Wave 1 to 5 (Grade 5 to 9).
Genre. We analysed genre of leisure reading for the following categories: (1) detective novels, thrillers, horror or fantasy (e.g., Harry Potter or Lord of the Rings); (2) modern classic literature (e.g., by George Orwell or Günther Grass); (3) nonfiction; and (4) comic books. One item was employed for each category and responses were made on a five-point Likert scale (1 = never or seldom to 5 = daily). These data exist for Wave 1 (Grade 5) and Waves 3 (Grade 7) through 5 (Grade 9).
Social-cognitive behavior. As indicators of real-life social-cognitive behavior, two subscales from the Strengths and Difficulties Questionnaire (SDQ) 56 were used: Prosocial Behavior and Peer Problems/Problem Behavior. The former represents prosociality (e.g., "I try to be nice to other people, their feelings are important to me. "; "I am ready to help people when they are injured, sick or sad. ") and the latter social adjustment to the peer group (e.g., "Most of the time I spend alone; I rather concentrate on myself. "; "I have one or several good friends. "). The SDQ is a widely used instrument in research and practical settings, for several disciplines such as medicine, psychology, and education. Its popularity is in part evidenced by the fact it has been translated into several other languages and adapted for different cultures 57,58 . For both measures, we had access to both self-and parental reports that were available at two measurement points, Wave 2 (Grade 6) and Wave 6 (Grade 9). For both measures, we used the self-reports for the pre-registered confirmatory part of the manuscript and the parental reports for additional exploratory analyses. Each subscale consists of five items with responses made using a 3-point Likert scale: not applicable (0), partly applicable (1), clearly applicable (2). For each scale, the sum score ranges between 0 and 10 points. A recent review reports sufficient test-retest reliability for the self-report form of both scales (0.60 and 0.63) 57 . To allow direct comparisons with previous research, we used the self-report version of the SDQ. Another advantage of using the self-report data, instead of the parental report, is that parental reports are unlikely to accurately reflect experiences encountered when students are in school and away from home (unless students tell their parents everything that occurs at school, in relation to prosocial behavior or peer problems). This lack of insight into student experiences while at school may explain why the parental version is only moderately correlated with the self-report version 57 .
Non-verbal intelligence. Our control variable of non-verbal intelligence was measured using a matrices test constructed by the NEPS research group (NEPS-MAT), which assesses reasoning 59 . Each of 12 items consists of several horizontally and vertically arranged fields, in which different geometric elements are depicted with one field left empty. The logical rules on which the pattern of shapes is based must be deduced in order to select the right complement for the empty field (akin to Raven's progressive matrices 60 ). The scores range from 0 to 12. The test was administered in Wave 1 (Grade 5) and in Wave 6 (Grade 9). We used the score in Grade 5 as a control variable in the analyses.
Analyses. All of our analyses were preregistered on AsPredicted (see https:// aspre dicted. org/ 8gf5b. pdf) before requesting access to the NEPS data (i.e., we had no access to the data before the registered report had been accepted). We had no knowledge of this dataset other than what is reported in the publicly available documents (e.g., codebook, information on survey instruments; see https:// www. neps-data. de/ Data-Center/ Dataand-Docum entat ion/ Start-Cohort-Grade-5/ Docum entat ion), which provided no insight into our research questions. The NEPS data are available for public use and can be requested from the Leibniz Institute for Educational Trajectories (LIfBi).
Data handling and analysis. General procedure for structural equation modeling. The main analyses was conducted with Mplus (version 8.8) 61 . Our analysis script is publicly available on OSF. As the data are publicly available, all our analyses are replicable in conjunction with this script. Categorical variables were dummy coded. For detecting outliers, we checked for univariate (Median Absolute Deviation > 3) and multivariate outliers (Mahalanobis-Minimum Covariance Determinant with p = 0.001) with the R package Routliers 62 . As the NEPS data are corrected for implausible values on individual variables and scores, only multivariate outliers (n = 56, resulting in a final sample N = 5152) were removed before the analyses were conducted. To address nonindependence of observations (due to students being nested in classes and schools), we had initially proposed to conduct our analyses with the Mplus option type set as complex twolevel 61 ). However, as class attendance varied for many students from Grade 5 to Grade 9, and the school attended also changed for a considerable portion of www.nature.com/scientificreports/ the sample, we conducted all analyses without nesting. To account for this change in procedure, we additionally performed all analyses with the students who remained in the same school (N = 3387, with multivariate outliers removed N = 3344), in which we included the nesting in schools (Mplus option type set as complex) The code and results of these additional analyses are also publicly available on OSF. This was a suitable approach because we were interested in relations between variables on student level and did not use any predictors on class or school level. Accordingly, we set the estimator to WLSMV (instead of MLR), which is the default for ordinal variables and which provides information on model fit. Concomitant with the change of the estimator, we used multiple imputation (with covariates included; 40 imputations 63 ) to address missing data (instead of FIML, which is used with MLR).
Model fit and model comparison. As indicators of model fit, the comparative fit index (CFI), the Tucker-Lewis index (TLI), the root mean square error of approximation (RMSEA), and the standardized root mean square residual (SRMR) were reported. Values of 0.95 or higher for the CFI and TLI, lower than 0.06 for the RMSEA, and lower than 0.08 for the SRMR are indicative of good model fit 64 . We also considered the statistical significance of the Chi-Square test in model evaluation. However, given the large sample size, this test has an extremely high power so that a statistically significant Chi-Square test per se is not indicative of bad model fit. To assess the importance of individual paths, we examined whether the paths were statistically significant and whether model fit decreased when a path was dropped from the model. In addition to an inspection of the model fit indices (CFI, TLI, RMSEA, SRMR), Chi-Square difference tests were calculated to compare nested models. Statistically significant Chi-Square difference tests or CFI, TLI, RMSEA, and SRMR values that no longer indicate good model fit were treated as indicators that model fit had significantly worsened. Further interpretation of the importance of statistically significant individual paths were guided by the size of the path coefficient in the fully standardized solution.
Latent variables and measurement invariance testing. The SDQ scales and leisure reading (in the cross-lagged panel models) were modelled as latent variables. For SDQ scales, the five variables were used as indicator variables. For leisure reading, the two items were used as indicator variables.
To do so, we tested for longitudinal measurement invariance by comparing a series of nested factor models, adding specific restrictions to the models 65,66 . In a first step, we confirmed configural invariance, which indicates an equivalent factor structure across time. In a second step, we tested for weak factorial invariance, constraining the factor loading over time to be equal. In a third step, we tested for strong factorial invariance, also constraining the thresholds over time to be equal. In a final step, strict invariance was tested, establishing that the residual variances were also equivalent across time. Changes in CFI greater than − 0.01 and in RMSEA greater than 0.015 were considered a meaningful decrease in model fit 67 . If weak invariance could be established, analyses were conducted with latent variables. In addition, in a next step, the variables were examined for partial measurement invariance 65,68 . Finally, If weak invariance could not have been established, we had planned to use a parceling approach (i.e., the balancing approach 69 ), but that was not necessary. Parceling has several advantages over alternative methods, such as higher reliability, lower likelihood of distributional violations, and fewer parameter estimates 69 .
Hypothesis testing. Examining leisure reading: cross-lagged panel design analyses. To test the hypotheses that leisure reading predicts future prosocial behavior/social adjustment while controlling for earlier prosocial behavior/social adjustment and other control variables (non-verbal intelligence, migration background, socioeconomic status), we modelled the relationship between leisure reading and prosocial behavior/social adjustment in a two-way cross-lagged panel design (see Fig. 1). The analyses were conducted separately for prosocial behavior and social adjustment. Prosocial behavior/social adjustment and leisure reading were treated as latent variables.
Does leisure reading predict future prosocial behavior? We conducted two-way cross-lagged panel design analyses that examined associations between leisure reading (Grade 6 and 9) and prosocial behavior (self-reports; Grade 6 and 9), controlling for non-verbal intelligence, migration background, and socioeconomic status.
Does leisure reading predict future social adjustment? We conducted two-way cross-lagged panel design analyses (analogous to the analysis of prosocial behavior) that examined associations between leisure reading (Grade 6 and 9) and social adjustment (self-reports; Grade 6 and 9), controlling for non-verbal intelligence, migration background, and socioeconomic status.
Examining leisure reading: cumulative analyses. To test the hypotheses that cumulative leisure reading predicts future prosocial behavior/social adjustment while controlling for earlier prosocial behavior/social adjustment and other control variables (non-verbal intelligence, migration background, socioeconomic status), we examined cumulative leisure reading across several grades using SEM (see Fig. 2). The analyses were conducted separately for prosocial behavior and social adjustment. The SDQ scales were treated as latent variables. Cumulative leisure reading was modelled as a manifest variable, averaging all items across Grades 5 to 8 (in analogy to the treatment of cumulative practice in expertise research 70 ).
Does cumulative reading predict future prosocial behavior? We constructed structural equation models examining associations between leisure reading (mean of Grades 5 to 8) and prosocial behavior (self-reports; Grade 9), www.nature.com/scientificreports/ controlling for earlier prosocial behavior (self-report; Grade 6), non-verbal intelligence, migration background, and socioeconomic status.
Does cumulative reading predict future social adjustment? We constructed structural equation models that examined associations between leisure reading (mean of Grades 5 to 8) and social adjustment (self-reports; Grade 9), controlling for earlier social adjustment (self-report; Grade 6), non-verbal intelligence, migration background, and socioeconomic status.
Examining the role of genre: cumulative analyses. We examined whether there were differences between genres in effects on prosocial behavior and social adjustment, focusing on classic literature, popular literature, comic books, and nonfiction (see Fig. 3). To do so, we conducted two separate structural equation models: one for prosocial behavior and one for social adjustment. Prosocial behavior/social adjustment (self-reports; Grade 9), was predicted by cumulative reading experience in each of the four genre categories (averaged across Grades 5, 7, and 8), controlling for previous social-cognitive behavior (respective SDQ scales; measured in Grade 6) and other control variables (non-verbal intelligence, migration background, socioeconomic status). Reading genres was treated as manifest variables and averaged across grades. SDQ scales was treated as latent variables.
Does genre differentially relate to future prosocial behavior? We constructed structural equation models examining associations between leisure reading of different genre (mean of Grades 5, 7, and 8) and prosocial behavior (self-reports; Grade 9), controlling for earlier prosocial behavior (self-report; Grade 6), non-verbal intelligence, migration background, and socioeconomic status.
Does genre differentially relate to future social adjustment? We constructed structural equation models examining associations between leisure reading of different genre (mean of Grades 5, 7, and 8) and social adjustment (self-reports; Grade 9), controlling for earlier prosocial behavior (self-report; Grade 6), non-verbal intelligence, migration background, and socioeconomic status.
Power analysis. Due to the archival nature of these data an a priori power analysis before data collection was not possible. We performed a power analysis for each type of SEM with the R package simsem (version 0.5-16), which provides a framework for Monte Carlo simulations of structural equation models (script is available on OSF) 71 . As we had little information on the cluster structure of the NEPS data, no knowledge about expected ICCs at level 2 (class) and 3 (school), we used a simulation without clustering. Using the lavaan option in simsem, we first specified a population model for each of the planned analyses (with and without covariates, separately) that was informed by the previous literature 2, 11,24,25,35,[72][73][74][75][76][77] . Then we specified corresponding analysis models. Given a power of 0.95, a significance threshold of 0.05, and a standardized path coefficient of 0.10 between reading and social-cognitive behavior 11,25,26,31 , around 2,350 participants were needed for detecting the path of interest in the cross-lagged model (with and without covariates). Around 2,100 participants were needed for the path of interest in the cumulative model (with and without covariates). Finally, in the cumulative genre model, around 3,900 participants were needed for any of the individual paths of interest (with and without covariates). As suggested by the editorial office, we additionally performed a power analysis with a power of 0.80 for each type of SEM, which is a conventionally used threshold for power calculations in psychological research 78 . Given a power of 0.80, around 1450 participants were needed for the cross-lagged model, around 1300 for the cumulative model, and 2350 for the cumulative genre model.

Exploratory analyses.
In exploratory analyses, we repeated all aforementioned analyses using the parental reports of the SDQ scales instead of the self-reports, which were used for confirmatory hypothesis testing. In addition, we explored the robustness of our findings, by including gender, reading competence, and trait openness to experience as additional control variables in the analyses.
Openness to experience. Our control variable of openness to experience (which was included in additional exploratory analyses to explore the robustness of our findings) was assessed with two self-report items: "I do not care much about arts" and "I have an active imagination, I am an imaginative person. " Both items were taken from the Big Five Inventory-10 (BFI-10) 79 and were answered on a five-point Likert scale (1 = does not apply at all to 5 = applies completely). The first item was reversed, and then both averaged to obtain a single score, ranging from 1 to 5. Openness to experience was assessed via self-reports at Wave 3 and 5, and parental reports at Wave 3 and 6. The Openness to Experience subscale of the German BFI-10 showed adequate test-retest reliability (r tt = 0.78) and a high correlation with the standard 9-item scale from the BFI-44 (r = 0.80) 79 . We used the selfreport score of the earliest assessment (in Wave 3) as a control variable in the analyses.
Reading competence. Our control variable of reading competence (which was included in additional exploratory analyses to explore the robustness of our findings) was measured in this dataset using a task constructed by the NEPS research group. It consists of five different text types: informational, commenting or argumenting, literary, instructional, and advertising. Test items for each kind of text rely on different cognitive processes (e.g., finding information, drawing conclusions) 80 . Three different item formats were employed: (1) multiple-choice items; (2) true or false statements; and (3) matching tasks, where a partial title must be assigned to the appropriate section of a text (as an example). A partial credit model was used for scaling the data and manifest scale scores are provided in the form of weighted likelihood estimates 81 . Reading competence was assessed at Wave